Scalable Group Detection via a Mutual Information Model

نویسندگان

  • Jafar Adibi
  • Hans Chalupsky
چکیده

A major problem in the area of link discovery is the discovery of hidden organizational structure such as groups and their members [5]. The group detection task can be further qualified into (1) discovering hidden members of known groups (or group extension) and (2) identifying completely unknown groups. Adibi et al. [1] describe the KOJAK Group Finder (GF) system which uses a novel mutual information (MI) approach combined with logic-based reasoning to find hidden groups and group members in large evidence databases. In this paper we report on the wider applicability and scalability of the GF by applying it to a variety of synthetic datasets that contain up to 7,500,000 links. The GF detects groups in four phases. (1) A logic-based group seed generator analyzes the evidence and outputs a set of seed groups using deductive and abductive reasoning. (2) An MI model finds likely new candidates for each group, producing an extended group. (3) The MI model is used to rank these likely members by how strongly connected they are to the seed members. (4) The ranked extended group is pruned using a threshold to produce the final output. After phase 1 has completed and seed groups have been generated from available evidence, the GF tries to identify additional members by looking for people that are strongly connected with one or more of the seed members. To find two strongly connected entities, we aggregate the known links between them and statistically contrast them with connections to other candidates and the general population. This is done by an MI model that exploits evidence such as individuals sharing an attribute (e.g., their address) or being involved in the same activity (e.g., communicating via email). These attributes and actions are represented as random variables and we measure connection strength by measuring the MI between them. If the variables (or entities) are independent, the MI between them is zero. If they are strongly dependent, the MI between them is large. MI between Xand Y is defined as:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Measuring Statistical Dependence via the Mutual Information Dimension

We propose to measure statistical dependence between two random variables by the mutual information dimension (MID), and present a scalable parameter-free estimation method for this task. Supported by sound dimension theory, our method gives an effective solution to the problem of detecting interesting relationships of variables in massive data, which is nowadays a heavily studied topic in many...

متن کامل

Mutual Information-based Intrusion Detection Model for Industrial Internet

High dimension, redundancy attributes and high computing cost issues usually exist in the industrial Internet intrusion detection field. For solving these problems, the mutual information-based intrusion detection model for industrial Internet was proposed. Firstly, by using features selection method based on mutual information, the attributes set was reduced and traffic characteristics vector ...

متن کامل

Authorization models for secure information sharing: a survey and research agenda

This article presents a survey of authorization models and considers their 'fitness-for-purpose' in facilitating information sharing. Network-supported information sharing is an important technical capability that underpins collaboration in support of dynamic and unpredictable activities such as emergency response, national security, infrastructure protection, supply chain integration and emerg...

متن کامل

Intelligent scalable image watermarking robust against progressive DWT-based compression using genetic algorithms

Image watermarking refers to the process of embedding an authentication message, called watermark, into the host image to uniquely identify the ownership. In this paper a novel, intelligent, scalable, robust wavelet-based watermarking approach is proposed. The proposed approach employs a genetic algorithm to find nearly optimal positions to insert watermark. The embedding positions coded as chr...

متن کامل

A Saliency Detection Model via Fusing Extracted Low-level and High-level Features from an Image

Saliency regions attract more human’s attention than other regions in an image. Low- level and high-level features are utilized in saliency region detection. Low-level features contain primitive information such as color or texture while high-level features usually consider visual systems. Recently, some salient region detection methods have been proposed based on only low-level features or hig...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004